Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing
Identifieur interne : 000004 ( France/Analysis ); précédent : 000003; suivant : 000005Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing
Auteurs : Mahmoud Soua [France] ; Rostom Kachouri [France] ; Mohamed Akil [France]Source :
English descriptors
Abstract
Nowadays, more and more scanned documents are converted into editable electronic representation. This proceeding relies on the Optical Character Recognition (OCR) tool-chain. Generally, an OCR system is based on the important binarization step that separates character strokes from the background document. In this context, one of more robust binarization methods is the recently proposed Hybrid Binarization based on Kmeans (HBK). It handles effectively scanned documents which includes text on simple background. Nevertheless, in Heterogeneous documents , HBK ends up with some issues when extracting foreground text from complex background images. Moreover, HBK assumes to have a dark foreground against a clear background. Otherwise, it fails to render correct binarization colors. In this paper, we propose to improve the HBK method for handling efficiently Heterogeneous documents. Indeed, our proposal employs a layout analysis process that classify document regions into text and image. Image regions are enhanced with Gamma Correction (GC) before HBK binarization. Text regions are treated directly with HBK, keeping its effectiveness on text with homogeneous background. To ensure a robust and independent color rendering in the binarized documents, we control the labeling polarity of text and background through a pixel density-based technique. According to our experiments on LRDE and ICDAR datasets, we demonstrate that I-HBK outperforms HBK when dealing with Heterogeneous documents in both F-measure and OCR accuracy.
Url:
DOI: 10.1109/ISPA.2015.7306060
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 000063
- to stream Hal, to step Curation: 000063
- to stream Hal, to step Checkpoint: 000004
- to stream Main, to step Merge: 000016
- to stream Main, to step Curation: 000016
- to stream Main, to step Exploration: 000016
- to stream France, to step Extraction: 000004
Links to Exploration step
Hal:hal-01309993Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing</title>
<author><name sortKey="Soua, Mahmoud" sort="Soua, Mahmoud" uniqKey="Soua M" first="Mahmoud" last="Soua">Mahmoud Soua</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-3210" status="VALID"><idno type="RNSR">200212717U</idno>
<orgName>Laboratoire d'Informatique Gaspard-Monge</orgName>
<orgName type="acronym">LIGM</orgName>
<desc><address><addrLine>Université de Paris-Est - Marne-la-Vallée, Cité Descartes, Bâtiment Copernic, 5 bd Descartes, 77454 Marne-la-Vallée Cedex 2, Inst Gaspard Monge</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://ligm.u-pem.fr</ref>
</desc>
<listRelation><relation active="#struct-301243" type="direct"></relation>
<relation active="#struct-301545" type="direct"></relation>
<relation active="#struct-302085" type="direct"></relation>
<relation active="#struct-304949" type="direct"></relation>
<relation name="UMR8049" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-301243" type="direct"><org type="institution" xml:id="struct-301243" status="VALID"><orgName>Université Paris-Est Marne-la-Vallée</orgName>
<orgName type="acronym">UPEM</orgName>
<desc><address><addrLine>5 boulevard Descartes - Champs-sur-Marne - 77454 Marne-la-Vallée Cedex2 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.u-pem.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301545" type="direct"><org type="institution" xml:id="struct-301545" status="OLD"><orgName>École des Ponts ParisTech (ENPC)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-302085" type="direct"><org type="institution" xml:id="struct-302085" status="VALID"><orgName>Fédération de Recherche Bézout</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-304949" type="direct"><org type="institution" xml:id="struct-304949" status="INCOMING"><orgName>ESIEE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR8049" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Kachouri, Rostom" sort="Kachouri, Rostom" uniqKey="Kachouri R" first="Rostom" last="Kachouri">Rostom Kachouri</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-3210" status="VALID"><idno type="RNSR">200212717U</idno>
<orgName>Laboratoire d'Informatique Gaspard-Monge</orgName>
<orgName type="acronym">LIGM</orgName>
<desc><address><addrLine>Université de Paris-Est - Marne-la-Vallée, Cité Descartes, Bâtiment Copernic, 5 bd Descartes, 77454 Marne-la-Vallée Cedex 2, Inst Gaspard Monge</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://ligm.u-pem.fr</ref>
</desc>
<listRelation><relation active="#struct-301243" type="direct"></relation>
<relation active="#struct-301545" type="direct"></relation>
<relation active="#struct-302085" type="direct"></relation>
<relation active="#struct-304949" type="direct"></relation>
<relation name="UMR8049" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-301243" type="direct"><org type="institution" xml:id="struct-301243" status="VALID"><orgName>Université Paris-Est Marne-la-Vallée</orgName>
<orgName type="acronym">UPEM</orgName>
<desc><address><addrLine>5 boulevard Descartes - Champs-sur-Marne - 77454 Marne-la-Vallée Cedex2 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.u-pem.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301545" type="direct"><org type="institution" xml:id="struct-301545" status="OLD"><orgName>École des Ponts ParisTech (ENPC)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-302085" type="direct"><org type="institution" xml:id="struct-302085" status="VALID"><orgName>Fédération de Recherche Bézout</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-304949" type="direct"><org type="institution" xml:id="struct-304949" status="INCOMING"><orgName>ESIEE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR8049" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Akil, Mohamed" sort="Akil, Mohamed" uniqKey="Akil M" first="Mohamed" last="Akil">Mohamed Akil</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-3210" status="VALID"><idno type="RNSR">200212717U</idno>
<orgName>Laboratoire d'Informatique Gaspard-Monge</orgName>
<orgName type="acronym">LIGM</orgName>
<desc><address><addrLine>Université de Paris-Est - Marne-la-Vallée, Cité Descartes, Bâtiment Copernic, 5 bd Descartes, 77454 Marne-la-Vallée Cedex 2, Inst Gaspard Monge</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://ligm.u-pem.fr</ref>
</desc>
<listRelation><relation active="#struct-301243" type="direct"></relation>
<relation active="#struct-301545" type="direct"></relation>
<relation active="#struct-302085" type="direct"></relation>
<relation active="#struct-304949" type="direct"></relation>
<relation name="UMR8049" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-301243" type="direct"><org type="institution" xml:id="struct-301243" status="VALID"><orgName>Université Paris-Est Marne-la-Vallée</orgName>
<orgName type="acronym">UPEM</orgName>
<desc><address><addrLine>5 boulevard Descartes - Champs-sur-Marne - 77454 Marne-la-Vallée Cedex2 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.u-pem.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301545" type="direct"><org type="institution" xml:id="struct-301545" status="OLD"><orgName>École des Ponts ParisTech (ENPC)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-302085" type="direct"><org type="institution" xml:id="struct-302085" status="VALID"><orgName>Fédération de Recherche Bézout</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-304949" type="direct"><org type="institution" xml:id="struct-304949" status="INCOMING"><orgName>ESIEE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR8049" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01309993</idno>
<idno type="halId">hal-01309993</idno>
<idno type="halUri">https://hal-upec-upem.archives-ouvertes.fr/hal-01309993</idno>
<idno type="url">https://hal-upec-upem.archives-ouvertes.fr/hal-01309993</idno>
<idno type="doi">10.1109/ISPA.2015.7306060</idno>
<date when="2015-09-07">2015-09-07</date>
<idno type="wicri:Area/Hal/Corpus">000063</idno>
<idno type="wicri:Area/Hal/Curation">000063</idno>
<idno type="wicri:Area/Hal/Checkpoint">000004</idno>
<idno type="wicri:Area/Main/Merge">000016</idno>
<idno type="wicri:Area/Main/Curation">000016</idno>
<idno type="wicri:Area/Main/Exploration">000016</idno>
<idno type="wicri:Area/France/Extraction">000004</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing</title>
<author><name sortKey="Soua, Mahmoud" sort="Soua, Mahmoud" uniqKey="Soua M" first="Mahmoud" last="Soua">Mahmoud Soua</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-3210" status="VALID"><idno type="RNSR">200212717U</idno>
<orgName>Laboratoire d'Informatique Gaspard-Monge</orgName>
<orgName type="acronym">LIGM</orgName>
<desc><address><addrLine>Université de Paris-Est - Marne-la-Vallée, Cité Descartes, Bâtiment Copernic, 5 bd Descartes, 77454 Marne-la-Vallée Cedex 2, Inst Gaspard Monge</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://ligm.u-pem.fr</ref>
</desc>
<listRelation><relation active="#struct-301243" type="direct"></relation>
<relation active="#struct-301545" type="direct"></relation>
<relation active="#struct-302085" type="direct"></relation>
<relation active="#struct-304949" type="direct"></relation>
<relation name="UMR8049" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-301243" type="direct"><org type="institution" xml:id="struct-301243" status="VALID"><orgName>Université Paris-Est Marne-la-Vallée</orgName>
<orgName type="acronym">UPEM</orgName>
<desc><address><addrLine>5 boulevard Descartes - Champs-sur-Marne - 77454 Marne-la-Vallée Cedex2 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.u-pem.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301545" type="direct"><org type="institution" xml:id="struct-301545" status="OLD"><orgName>École des Ponts ParisTech (ENPC)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-302085" type="direct"><org type="institution" xml:id="struct-302085" status="VALID"><orgName>Fédération de Recherche Bézout</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-304949" type="direct"><org type="institution" xml:id="struct-304949" status="INCOMING"><orgName>ESIEE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR8049" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Kachouri, Rostom" sort="Kachouri, Rostom" uniqKey="Kachouri R" first="Rostom" last="Kachouri">Rostom Kachouri</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-3210" status="VALID"><idno type="RNSR">200212717U</idno>
<orgName>Laboratoire d'Informatique Gaspard-Monge</orgName>
<orgName type="acronym">LIGM</orgName>
<desc><address><addrLine>Université de Paris-Est - Marne-la-Vallée, Cité Descartes, Bâtiment Copernic, 5 bd Descartes, 77454 Marne-la-Vallée Cedex 2, Inst Gaspard Monge</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://ligm.u-pem.fr</ref>
</desc>
<listRelation><relation active="#struct-301243" type="direct"></relation>
<relation active="#struct-301545" type="direct"></relation>
<relation active="#struct-302085" type="direct"></relation>
<relation active="#struct-304949" type="direct"></relation>
<relation name="UMR8049" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-301243" type="direct"><org type="institution" xml:id="struct-301243" status="VALID"><orgName>Université Paris-Est Marne-la-Vallée</orgName>
<orgName type="acronym">UPEM</orgName>
<desc><address><addrLine>5 boulevard Descartes - Champs-sur-Marne - 77454 Marne-la-Vallée Cedex2 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.u-pem.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301545" type="direct"><org type="institution" xml:id="struct-301545" status="OLD"><orgName>École des Ponts ParisTech (ENPC)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-302085" type="direct"><org type="institution" xml:id="struct-302085" status="VALID"><orgName>Fédération de Recherche Bézout</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-304949" type="direct"><org type="institution" xml:id="struct-304949" status="INCOMING"><orgName>ESIEE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR8049" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Akil, Mohamed" sort="Akil, Mohamed" uniqKey="Akil M" first="Mohamed" last="Akil">Mohamed Akil</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-3210" status="VALID"><idno type="RNSR">200212717U</idno>
<orgName>Laboratoire d'Informatique Gaspard-Monge</orgName>
<orgName type="acronym">LIGM</orgName>
<desc><address><addrLine>Université de Paris-Est - Marne-la-Vallée, Cité Descartes, Bâtiment Copernic, 5 bd Descartes, 77454 Marne-la-Vallée Cedex 2, Inst Gaspard Monge</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://ligm.u-pem.fr</ref>
</desc>
<listRelation><relation active="#struct-301243" type="direct"></relation>
<relation active="#struct-301545" type="direct"></relation>
<relation active="#struct-302085" type="direct"></relation>
<relation active="#struct-304949" type="direct"></relation>
<relation name="UMR8049" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-301243" type="direct"><org type="institution" xml:id="struct-301243" status="VALID"><orgName>Université Paris-Est Marne-la-Vallée</orgName>
<orgName type="acronym">UPEM</orgName>
<desc><address><addrLine>5 boulevard Descartes - Champs-sur-Marne - 77454 Marne-la-Vallée Cedex2 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.u-pem.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301545" type="direct"><org type="institution" xml:id="struct-301545" status="OLD"><orgName>École des Ponts ParisTech (ENPC)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-302085" type="direct"><org type="institution" xml:id="struct-302085" status="VALID"><orgName>Fédération de Recherche Bézout</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-304949" type="direct"><org type="institution" xml:id="struct-304949" status="INCOMING"><orgName>ESIEE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR8049" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
<idno type="DOI">10.1109/ISPA.2015.7306060</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="mix" xml:lang="en"><term>Binarization</term>
<term>Gamma Correction</term>
<term>HBK</term>
<term>Heterogeneous Documents</term>
<term>OCR</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Nowadays, more and more scanned documents are converted into editable electronic representation. This proceeding relies on the Optical Character Recognition (OCR) tool-chain. Generally, an OCR system is based on the important binarization step that separates character strokes from the background document. In this context, one of more robust binarization methods is the recently proposed Hybrid Binarization based on Kmeans (HBK). It handles effectively scanned documents which includes text on simple background. Nevertheless, in Heterogeneous documents , HBK ends up with some issues when extracting foreground text from complex background images. Moreover, HBK assumes to have a dark foreground against a clear background. Otherwise, it fails to render correct binarization colors. In this paper, we propose to improve the HBK method for handling efficiently Heterogeneous documents. Indeed, our proposal employs a layout analysis process that classify document regions into text and image. Image regions are enhanced with Gamma Correction (GC) before HBK binarization. Text regions are treated directly with HBK, keeping its effectiveness on text with homogeneous background. To ensure a robust and independent color rendering in the binarized documents, we control the labeling polarity of text and background through a pixel density-based technique. According to our experiments on LRDE and ICDAR datasets, we demonstrate that I-HBK outperforms HBK when dealing with Heterogeneous documents in both F-measure and OCR accuracy.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
</list>
<tree><country name="France"><noRegion><name sortKey="Soua, Mahmoud" sort="Soua, Mahmoud" uniqKey="Soua M" first="Mahmoud" last="Soua">Mahmoud Soua</name>
</noRegion>
<name sortKey="Akil, Mohamed" sort="Akil, Mohamed" uniqKey="Akil M" first="Mohamed" last="Akil">Mohamed Akil</name>
<name sortKey="Kachouri, Rostom" sort="Kachouri, Rostom" uniqKey="Kachouri R" first="Rostom" last="Kachouri">Rostom Kachouri</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000004 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000004 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= France |étape= Analysis |type= RBID |clé= Hal:hal-01309993 |texte= Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing }}
This area was generated with Dilib version V0.6.32. |